In this study regional data of Helsinki is combined with 2015 parliamentary election results. Existence of green-red bubble is statistically confirmed by using Principal component analysis. The bubble is identified to regions where significantly high share of population consist of 18-44 year olds and women. In the second research question left-wing party Vasemmistoliitto’s support is inspected by using linear discriminant analysis. We conclude that support is not easy to spesify to certain characteristics as Vasemmistoliitto appears to have a diverse voter base split between young voters and older ones. Prediction with LDA model achieves reasonable performance.
1st Research Question: Is there a green-red voter bubble in Helsinki?
2nd Research Question: What regional characteristics predict, left-wing party, Vasemmistoliitto’s support?
To answer these questions I have built a dataset that consists of regional open data from Aluesarjat and Tilastokeskus. Constructing the dataset required couple thousand lines of R code. It would not be fair to put someone to evaluate that amount of R code, so I chose only to provide one file to illustrate what enormous amount of work the data wrangling was. In the wrangling script I introduced regional data which was later included into the final dataset. You can see the data wrangling file here: Data wrangling. I will not discuss the details of the dataset, but the most effort was needed to combine regional data to voting districts.
Before we get started with the analysis I want to introduce two Shiny applications which I programmed to visualize the data. Please take a look at Helsinki clusters app and Helsinki parties app. I used leaflet and plotlyto create dynamic web applications. The apps are still in development and I will add more features to them as I have more time.
Here is the list of variables which are included in the analysis:
aanestysalue_nro = Number of the voting district (may include several)aanestysalue_nimi = Name of the voting district (may include several)Kuluyks_keskiarvo2014 = Average income of a person living in the regionKuluyks_mediaani2014 = Median income of a person living in the regionGini2014 = Gini coefficient for the regionIP.p to VIHR.p = Vote share of a party in the region (IP = Itsenäisyyspuolue, VIHR = Vihreät)Punavih.p = Vote share of green-reds (Vasemmistoliitto and Vihreät, VAS.p + VIHR.p)Naisia_2015.p = Share of women living in the regionRuotsinkielisia_2015.p = Share of Swedish speaking people in the regionKorkeakoulutus2015.p = Share of higher education of the people living in the regionTyottomat2014.p = Unemployment percentage for the regionSuurituloisinX_2014.p = Share of people in the region belonging to the highest 10% income decileAllemediaanitulot_2014.p = Share of people in the region belonging to the income deciles bellow median incomeNolla_17.p to Yli_65.p = Share of people having age of 0-17 in the region; other age groups are 18-29, 30-44, 45-64 and 65-Suurinpuolue = Name of the party that has the largest support in the regionlibrary(xlsx)
library(dplyr)
library(shiny)
df <- read.xlsx2("helsinki_election_2015.xlsx", startRow = 1, endRow = 106, sheetIndex = 1, header = T)
df <- df %>%
dplyr::select(aanestysalue_nro, aanestysalue_nimi, Kuluyks_keskiarvo2014, Kuluyks_mediaani2014,Gini2014,IP.p:Punavih.p,Naisia_2015.p:Yli_65.p,Suurinpuolue)
min <- which(colnames(df) == "Kuluyks_keskiarvo2014")
max <- which(colnames(df) == "Kuluyks_mediaani2014")
df[,min:max] <- df[,min:max] %>%
lapply(., function(x){as.numeric(as.character(x))})
min <- which(colnames(df) == "Gini2014")
max <- which(colnames(df) == "Yli_65.p")
for(x in min:max){
df[,x] <- as.numeric(as.character(sub(",", ".", df[,x])))
}
str(df)
## 'data.frame': 105 obs. of 34 variables:
## $ aanestysalue_nro : Factor w/ 105 levels "10A, 10B","10C",..: 18 24 25 27 29 30 35 36 37 43 ...
## $ aanestysalue_nimi : Factor w/ 105 levels "Ala-Malmi A",..: 13 50 51 96 7 9 56 57 58 27 ...
## $ Kuluyks_keskiarvo2014 : num 25494 24541 26016 30031 26849 ...
## $ Kuluyks_mediaani2014 : num 23771 22092 23296 26456 24304 ...
## $ Gini2014 : num 0.258 0.244 0.26 0.277 0.286 ...
## $ IP.p : num 0.005 0.004 0.003 0.005 0.004 0.001 0 0.004 0.005 0.001 ...
## $ KA.p : num 0.001 0.001 0.002 0.001 0 0.001 0 0.001 0 0.001 ...
## $ KD.p : num 0.017 0.018 0.016 0.021 0.024 0.024 0.012 0.019 0.014 0.026 ...
## $ KESK.p : num 0.05 0.074 0.066 0.083 0.086 0.095 0.073 0.069 0.084 0.076 ...
## $ KOK.p : num 0.134 0.159 0.2 0.239 0.278 0.303 0.526 0.304 0.324 0.226 ...
## $ KTP.p : num 0.001 0 0 0 0 0 0 0 0 0 ...
## $ Muut.p : num 0.003 0.002 0.005 0.002 0.001 0.001 0 0.002 0.002 0.001 ...
## $ Muutos2011.p : num 0.001 0.002 0.002 0.001 0.002 0 0 0 0.001 0.004 ...
## $ Piraattip..p : num 0.022 0.016 0.015 0.011 0.018 0.013 0.01 0.012 0.015 0.016 ...
## $ PS.p : num 0.101 0.152 0.129 0.14 0.076 0.092 0.051 0.078 0.076 0.14 ...
## $ RKP.p : num 0.082 0.054 0.042 0.048 0.086 0.078 0.127 0.116 0.085 0.079 ...
## $ SDP.p : num 0.147 0.193 0.19 0.169 0.11 0.103 0.051 0.125 0.114 0.204 ...
## $ SKP.p : num 0.005 0.016 0.014 0.007 0.004 0.002 0 0.004 0.003 0.003 ...
## $ STP.p : num 0.001 0.001 0.001 0 0 0 0 0 0 0.001 ...
## $ VAS.p : num 0.191 0.125 0.125 0.104 0.086 0.076 0.03 0.067 0.058 0.087 ...
## $ VIHR.p : num 0.241 0.183 0.192 0.169 0.224 0.209 0.12 0.2 0.215 0.134 ...
## $ Punavih.p : num 0.431 0.308 0.317 0.273 0.31 0.286 0.15 0.267 0.274 0.221 ...
## $ Naisia_2015.p : num 0.533 0.532 0.527 0.527 0.551 ...
## $ Ruotsinkielisia_2015.p : num 0.1066 0.0404 0.0416 0.0292 0.0832 ...
## $ Korkeakoulutus2015.p : num 0.378 0.309 0.33 0.407 0.452 ...
## $ Tyottomat2014.p : num 0.0972 0.1579 0.1453 0.1229 0.0956 ...
## $ SuurituloisinX_2014.p : num 0.0689 0.0603 0.0945 0.1322 0.0909 ...
## $ Allemediaanitulot_2014.p: num 0.44 0.495 0.458 0.365 0.382 ...
## $ Nolla_17.p : num 0.129 0.154 0.157 0.218 0.116 ...
## $ Kahdeksantoista_29.p : num 0.275 0.151 0.162 0.148 0.251 ...
## $ Kolmekymmenta_44.p : num 0.264 0.193 0.192 0.224 0.253 ...
## $ Neljakymmentaviisi_64.p : num 0.205 0.277 0.277 0.277 0.213 ...
## $ Yli_65.p : num 0.127 0.225 0.212 0.134 0.166 ...
## $ Suurinpuolue : Factor w/ 4 levels "KOK","PS","SDP",..: 4 3 1 1 1 1 1 1 1 1 ...
summary(df)
## aanestysalue_nro aanestysalue_nimi
## 10A, 10B : 1 Ala-Malmi A : 1
## 10C : 1 Ala-Malmi B : 1
## 11A : 1 Alppila A, Alppila B : 1
## 11B, 11C, 11D: 1 Alppila C, Alppila D : 1
## 11E, 11F : 1 Eira : 1
## 12A, 12B : 1 Etelä-Haaga A, Etelä-Haaga B: 1
## (Other) :99 (Other) :99
## Kuluyks_keskiarvo2014 Kuluyks_mediaani2014 Gini2014
## Min. : 21005 Min. :19565 Min. :0.1920
## 1st Qu.: 24801 1st Qu.:22578 1st Qu.:0.2505
## Median : 27927 Median :25032 Median :0.2712
## Mean : 30885 Mean :26012 Mean :0.2943
## 3rd Qu.: 32669 3rd Qu.:28523 3rd Qu.:0.3000
## Max. :100732 Max. :42395 Max. :0.6305
##
## IP.p KA.p KD.p KESK.p
## Min. :0.000000 Min. :0.000000 Min. :0.00600 Min. :0.03600
## 1st Qu.:0.002200 1st Qu.:0.000500 1st Qu.:0.01350 1st Qu.:0.05800
## Median :0.003200 Median :0.000900 Median :0.01820 Median :0.07460
## Mean :0.003497 Mean :0.000981 Mean :0.01822 Mean :0.07412
## 3rd Qu.:0.004400 3rd Qu.:0.001300 3rd Qu.:0.02380 3rd Qu.:0.08940
## Max. :0.010900 Max. :0.004000 Max. :0.03100 Max. :0.13300
##
## KOK.p KTP.p Muut.p
## Min. :0.0891 Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.1632 1st Qu.:0.0000000 1st Qu.:0.001500
## Median :0.2420 Median :0.0000000 Median :0.002100
## Mean :0.2568 Mean :0.0001867 Mean :0.002332
## 3rd Qu.:0.3240 3rd Qu.:0.0002000 3rd Qu.:0.003000
## Max. :0.5260 Max. :0.0031000 Max. :0.009500
##
## Muutos2011.p Piraattip..p PS.p RKP.p
## Min. :0.000000 Min. :0.00400 Min. :0.0345 Min. :0.01500
## 1st Qu.:0.000700 1st Qu.:0.01020 1st Qu.:0.0760 1st Qu.:0.03610
## Median :0.001100 Median :0.01300 Median :0.1130 Median :0.04800
## Mean :0.001374 Mean :0.01416 Mean :0.1216 Mean :0.06357
## 3rd Qu.:0.002000 3rd Qu.:0.01700 3rd Qu.:0.1603 3rd Qu.:0.07800
## Max. :0.006200 Max. :0.03260 Max. :0.2950 Max. :0.23300
##
## SDP.p SKP.p STP.p VAS.p
## Min. :0.0510 Min. :0.00000 Min. :0.000000 Min. :0.02100
## 1st Qu.:0.1161 1st Qu.:0.00250 1st Qu.:0.000000 1st Qu.:0.06810
## Median :0.1630 Median :0.00360 Median :0.000500 Median :0.08930
## Mean :0.1603 Mean :0.00431 Mean :0.000699 Mean :0.09724
## 3rd Qu.:0.2040 3rd Qu.:0.00540 3rd Qu.:0.001000 3rd Qu.:0.11490
## Max. :0.2984 Max. :0.01600 Max. :0.004600 Max. :0.24090
##
## VIHR.p Punavih.p Naisia_2015.p Ruotsinkielisia_2015.p
## Min. :0.0760 Min. :0.1010 Min. :0.4756 Min. :0.01530
## 1st Qu.:0.1338 1st Qu.:0.2140 1st Qu.:0.5172 1st Qu.:0.03080
## Median :0.1690 Median :0.2628 Median :0.5285 Median :0.04020
## Mean :0.1806 Mean :0.2779 Mean :0.5291 Mean :0.05786
## 3rd Qu.:0.2240 3rd Qu.:0.3222 3rd Qu.:0.5397 3rd Qu.:0.07070
## Max. :0.3257 Max. :0.5411 Max. :0.5758 Max. :0.21980
##
## Korkeakoulutus2015.p Tyottomat2014.p SuurituloisinX_2014.p
## Min. :0.1274 Min. :0.0230 Min. :0.0053
## 1st Qu.:0.3117 1st Qu.:0.0822 1st Qu.:0.0526
## Median :0.3999 Median :0.1030 Median :0.0921
## Mean :0.3959 Mean :0.1101 Mean :0.1187
## 3rd Qu.:0.4676 3rd Qu.:0.1343 3rd Qu.:0.1473
## Max. :0.6044 Max. :0.2110 Max. :0.4240
##
## Allemediaanitulot_2014.p Nolla_17.p Kahdeksantoista_29.p
## Min. :0.1291 Min. :0.0361 Min. :0.0836
## 1st Qu.:0.2672 1st Qu.:0.1353 1st Qu.:0.1446
## Median :0.3816 Median :0.1700 Median :0.1821
## Mean :0.3820 Mean :0.1714 Mean :0.1904
## 3rd Qu.:0.4662 3rd Qu.:0.2055 3rd Qu.:0.2235
## Max. :0.6161 Max. :0.3214 Max. :0.3868
##
## Kolmekymmenta_44.p Neljakymmentaviisi_64.p Yli_65.p Suurinpuolue
## Min. :0.1498 Min. :0.1720 Min. :0.0046 KOK :58
## 1st Qu.:0.1890 1st Qu.:0.2172 1st Qu.:0.1352 PS : 2
## Median :0.2159 Median :0.2576 Median :0.1658 SDP :24
## Mean :0.2248 Mean :0.2512 Mean :0.1622 VIHR:21
## 3rd Qu.:0.2532 3rd Qu.:0.2777 3rd Qu.:0.1907
## Max. :0.3536 Max. :0.3411 Max. :0.2568
##
I strongly reccommed using my Shiny applications to explore the data, but for a quick overview, summary produces sufficient information. Kokoomus has the biggest share of support in Helsinki, SDP is the second by the ammount of won regions and Vihreät third, but Vihreät has higher median in vote share than SDP. So we can say that SDP’s support is more concentrated. There are big differences in the proportion of age groups, wealth, higher education, share of Swedish speaking people between the regions.
Let’s continue with the exploration of the data by doing correlation matrix with ggpairs function. Unluckily we have so many variables (34 of which 31 is used for analysis) that we cannot visualize them all in one plot and by splitting the columns into two parts does not include all combinations of variables.
library(GGally)
library(ggplot2)
my_custom_cor_color <- function(data, mapping, color = I("black"), sizeRange = c(1, 5), ...) {
# get the x and y data to use the other code
x <- eval(mapping$x, data)
y <- eval(mapping$y, data)
ct <- cor.test(x,y)
r <- unname(ct$estimate)
rt <- format(r, digits=2)[1]
tt <- as.character(rt)
# plot the cor value
p <- ggally_text(
label = tt,
mapping = aes(),
xP = 0.5, yP = 0.5,
size = 6,
color=color,
...
) +
theme(
panel.background=element_rect(fill="white", color = "black"),
panel.grid.minor=element_blank(),
panel.grid.major=element_blank()
)
corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
if (r <= -0.8) {
corCol <- corColors[1]
} else if (r <= -0.6) {
corCol <- corColors[2]
} else if (r < 0.6) {
corCol <- corColors[3]
} else if (r < 0.8) {
corCol <- corColors[4]
} else {
corCol <- corColors[5]
}
p <- p + theme(
panel.background = element_rect(fill= corCol)
)
p
}
# sadly this hack does not work anymore to make the colors customly
myColors <- c("#0066FF","#4D4D4D","#FF3333", "#00B33C")
names(myColors) <- levels(df$Suurinpuolue)
colScale <- scale_colour_manual(name = "Suurinpuolue", values = myColors)
ggplot <- function(...) ggplot2::ggplot(...) + colScale
ggpairs(
df[,3:length(df)],
columns = 1:17, ## tata pitaa sit modaa
mapping = ggplot2::aes(color = Suurinpuolue),
upper = list(continuous = my_custom_cor_color),
diag = list(combo = colScale),
lower = list(
combo = colScale
)
)
ggplot <- function(...) ggplot2::ggplot(...) + colScale
ggpairs(
df[,3:length(df)],
columns = 18:31, ## tata pitaa sit modaa
mapping = ggplot2::aes(color = Suurinpuolue),
upper = list(continuous = my_custom_cor_color),
diag = list(combo = colScale),
lower = list(
combo = colScale
)
)
I tried to custom set the colours for ggpairs plots, as the data points are separated two four groups based on Suurinpuolue variable, but this proved not to be possible.
We can observe from the later correlation matrix that Punavih.p correlates strongly with young age, slightly positively with share of women, slightly negatively with high education and more negatively with share of people in the highest income decile. We can easily detect that VIHR.p has a mild positive correlation with high education (0.28), but because VAS.p has a stronger negative one the correlation between Punavih.p and higher education is around zero.
I am seeking an answer to the research question: Is there a green-red voter bubble in Helsinki?
To answer this I chose my statistical method to be PCA (principal component analysis) which suits well for multidimensional data, as PCA reduces the dimensions of the data to its main components. PCA seeks to find surfaces (in the data space) into which the projected data achieves the highest amount of variance. This ensures that the least amount of information of the data gets being lost.
PCA’s biplot illustration makes it possible to visualize connections between variables, components and the data. Our task is to visually inspect which variables contribute to which directions and what kind of interplay of variables can we determine.
My hypothesis is that there is a green-red bubble in which includes Kallio, Sörnäinen and Vallila regions. To validate the hypothesis we should expect to see variables from both Vasemmistoliitto and Vihreät to point towards the same direction. But more importantly their joint variable VAS+VIHR to point point towards the data points of Kallio, Sörnäinen and Vallilla.
To perform the analysis we need to scale the variables first. This requires that we get rid of all non-numeric variables. Then I will rename the variables for readability of the biplot. Let’s take a look at the newly created scaled variables:
df_scaled <- df %>%
dplyr::select(-aanestysalue_nro, -aanestysalue_nimi, -Suurinpuolue)
df_scaled <- scale(df_scaled)
summary(df_scaled)
## Kuluyks_keskiarvo2014 Kuluyks_mediaani2014 Gini2014
## Min. :-0.9293 Min. :-1.4327 Min. :-1.35950
## 1st Qu.:-0.5723 1st Qu.:-0.7631 1st Qu.:-0.58224
## Median :-0.2782 Median :-0.2177 Median :-0.30721
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.1678 3rd Qu.: 0.5581 3rd Qu.: 0.07544
## Max. : 6.5700 Max. : 3.6408 Max. : 4.46665
## IP.p KA.p KD.p
## Min. :-1.7239 Min. :-1.14047 Min. :-2.008180
## 1st Qu.:-0.6394 1st Qu.:-0.55916 1st Qu.:-0.775760
## Median :-0.1465 Median :-0.09412 Median :-0.003443
## Mean : 0.0000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.4451 3rd Qu.: 0.37093 3rd Qu.: 0.916764
## Max. : 3.6491 Max. : 3.50998 Max. : 2.099888
## KESK.p KOK.p KTP.p Muut.p
## Min. :-2.03549 Min. :-1.6040 Min. :-0.46749 Min. :-1.4818
## 1st Qu.:-0.86070 1st Qu.:-0.8952 1st Qu.:-0.46749 1st Qu.:-0.5288
## Median : 0.02573 Median :-0.1414 Median :-0.46749 Median :-0.1476
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.81605 3rd Qu.: 0.6429 3rd Qu.: 0.03339 3rd Qu.: 0.4241
## Max. : 3.14427 Max. : 2.5752 Max. : 7.29621 Max. : 4.5537
## Muutos2011.p Piraattip..p PS.p RKP.p
## Min. :-1.2464 Min. :-1.8007 Min. :-1.5670 Min. :-1.1482
## 1st Qu.:-0.6115 1st Qu.:-0.7017 1st Qu.:-0.8205 1st Qu.:-0.6494
## Median :-0.2488 Median :-0.2054 Median :-0.1550 Median :-0.3681
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5675 3rd Qu.: 0.5036 3rd Qu.: 0.6958 3rd Qu.: 0.3411
## Max. : 4.3766 Max. : 3.2687 Max. : 3.1187 Max. : 4.0053
## SDP.p SKP.p STP.p VAS.p
## Min. :-1.89884 Min. :-1.5230 Min. :-0.9094 Min. :-1.6827
## 1st Qu.:-0.76763 1st Qu.:-0.6395 1st Qu.:-0.9094 1st Qu.:-0.6432
## Median : 0.04733 Median :-0.2508 Median :-0.2589 Median :-0.1753
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.75977 3rd Qu.: 0.3854 3rd Qu.: 0.3915 3rd Qu.: 0.3897
## Max. : 2.40011 Max. : 4.1315 Max. : 5.0749 Max. : 3.1706
## VIHR.p Punavih.p Naisia_2015.p
## Min. :-1.8364 Min. :-1.9139 Min. :-2.8622
## 1st Qu.:-0.8216 1st Qu.:-0.6910 1st Qu.:-0.6368
## Median :-0.2037 Median :-0.1629 Median :-0.0323
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7619 3rd Qu.: 0.4799 3rd Qu.: 0.5669
## Max. : 2.5473 Max. : 2.8488 Max. : 2.4981
## Ruotsinkielisia_2015.p Korkeakoulutus2015.p Tyottomat2014.p
## Min. :-0.9937 Min. :-2.47165 Min. :-2.3395
## 1st Qu.:-0.6318 1st Qu.:-0.77483 1st Qu.:-0.7486
## Median :-0.4123 Median : 0.03721 Median :-0.1896
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.2998 3rd Qu.: 0.66051 3rd Qu.: 0.6516
## Max. : 3.7810 Max. : 1.92001 Max. : 2.7128
## SuurituloisinX_2014.p Allemediaanitulot_2014.p Nolla_17.p
## Min. :-1.2163 Min. :-2.100604 Min. :-2.42592
## 1st Qu.:-0.7090 1st Qu.:-0.953722 1st Qu.:-0.64688
## Median :-0.2854 Median :-0.003662 Median :-0.02458
## Mean : 0.0000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.3067 3rd Qu.: 0.698917 3rd Qu.: 0.61207
## Max. : 3.2743 Max. : 1.943795 Max. : 2.69061
## Kahdeksantoista_29.p Kolmekymmenta_44.p Neljakymmentaviisi_64.p
## Min. :-1.7318 Min. :-1.7078 Min. :-1.9778
## 1st Qu.:-0.7426 1st Qu.:-0.8155 1st Qu.:-0.8495
## Median :-0.1344 Median :-0.2032 Median : 0.1590
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5369 3rd Qu.: 0.6459 3rd Qu.: 0.6608
## Max. : 3.1851 Max. : 2.9313 Max. : 2.2434
## Yli_65.p
## Min. :-3.53969
## 1st Qu.:-0.60589
## Median : 0.08151
## Mean : 0.00000
## 3rd Qu.: 0.64087
## Max. : 2.12574
Next I will show the coefficients for the first three PCA components. There we can observe that age groups: 18-29v and 30-44v have the biggest coefficients in absolute terms in PC2 component. This is interesting because Vasemmistoliitto and Vihreät also have highly negatvie coefficients in PC2 component. I will talk more about the components and their relations with variables more after the biplot.
df_scaled <- as.data.frame(df_scaled)
# tama pistaa nimet kuntoon
min <- which(colnames(df_scaled) == "Naisia_2015.p")
max <- which(colnames(df_scaled) == "Yli_65.p")
colnames(df_scaled)[min:max] <- c("Naiset", "Ruotsinkieliset", "Korkeakoulutus", "Työttömyys", "Suurituloisin 10 %", "Pienituloisin 50%", "0-17v", "18-29v", "30-44v", "45-64v", "65v+", "Gini")
colnames(df_scaled)[1:3] <- c("Mediaanitulot", "Keskiarvotulot", "Gini")
# Renaming the variables for readability
colnames(df_scaled)[4:20] <- c("IP", "KA", "KD", "Keskusta", "Kokoomus", "KTP", "Muut", "Muutos2011", "Piraatit", "Perussuomalaiset", "RKP", "SDP", "SKP", "STP", "Vasemmistoliitto", "Vihreät", "VAS+VIHR")
rownames(df_scaled) <- df$aanestysalue_nimi
pca_puolue <- prcomp(df_scaled)
s <- summary(pca_puolue)
s$rotation[,1:3]
## PC1 PC2 PC3
## Mediaanitulot 0.234399509 0.08310528 0.11800972
## Keskiarvotulot 0.254426626 0.12478256 -0.07826694
## Gini 0.234216646 -0.02315811 0.24763613
## IP -0.216130639 0.04254659 0.15039153
## KA -0.183561660 0.03531061 0.16377006
## KD -0.190397143 0.14799140 0.01934255
## Keskusta -0.082342786 0.22135931 -0.19244349
## Kokoomus 0.260279581 0.10642263 -0.03960320
## KTP -0.090437119 0.01860063 0.16705683
## Muut -0.137805898 0.02188363 0.08208792
## Muutos2011 -0.165522083 0.08910878 0.03808296
## Piraatit -0.140084657 -0.25372259 -0.03392582
## Perussuomalaiset -0.225173061 0.18828384 -0.07444517
## RKP 0.220206952 -0.01417146 0.30158756
## SDP -0.246690882 0.13257410 0.05537124
## SKP -0.185226917 -0.06944672 0.04055914
## STP -0.145447276 0.02637016 0.15937758
## Vasemmistoliitto -0.147679616 -0.26729525 -0.03222177
## Vihreät 0.036088345 -0.35476476 -0.05943911
## VAS+VIHR -0.050385016 -0.34965018 -0.05292101
## Naiset -0.008221606 -0.09073054 0.38532431
## Ruotsinkieliset 0.217229666 -0.02401998 0.32701094
## Korkeakoulutus 0.270241546 -0.05353298 -0.07056864
## Työttömyys -0.241113640 0.02624027 0.25031155
## Suurituloisin 10 % 0.263706843 0.06567868 0.08614090
## Pienituloisin 50% -0.252074803 -0.09713782 0.17575416
## 0-17v -0.023470189 0.30424603 -0.21743284
## 18-29v -0.035089973 -0.32855875 -0.01174990
## 30-44v 0.064634805 -0.32171061 -0.15152342
## 45-64v -0.020168730 0.31313394 -0.04101381
## 65v+ 0.032479295 0.10989205 0.47493891
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_puolue, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], xlim=c(-0.25, 0.25), ylim=c(-0.25, 0.25))
title(main = "PCA analysis of Helsinki voting regions")
As we can see from the biplot, there is a clear indication of a red-green bubble (punavihreä kupla). Both Vasemmistoliitto and Vihreät are pointing in the same general direction (highly negative in respect to the second PCA component), but Vasemmistoliitto points points more to the left and Vihreät slightly towards right. If we look at their joint variable VAS+VIHR, it is pointing directly towards the regions of Kallio, Vallilla and Alppila (also Sörnäinen).
We can deduct from the biplot that the second component mainly consits of age group variables and share of women if we are not looking at party variables. More negative the PC2 component, the more young people and women are located in the region. More positive the PC2 component, the more older population there is in the region.
On the other hand, PC1 seems to be strongly linked to socio-economic variables (income, unempolyment and education). We can observe Kokoomus (the right-wing party) pointing towards right (slightly upwards), thus being linked to old age and economic prosperity. On the right slightly downwards direction we have the Swedish speaking people, which seem to live in the same areas where also the share of higher education is high.
On the left side of the biplot we have the social problems of unemployment and share of bellow median income people living in the region. Populist parties (Perussuomalaiset) in combination with the social democrats (SDP) are pointing to that direction.
We can conclude that our first hypothesis was confirmed. There is clear statistical evidence of green-red bubble.
My second research qestion was: What regional characteristics predict, left-wing party, Vasemmistoliitto’s support?
To approach this question I will use LDA (linear discriminant analysis). LDA differs from PCA so that it is supervised learning mehtod, where as PCA is unsupervised. LDA, similarly than PCA, also looks for linear combinations which could explain the data. LDA method seeks to minimize the differences within groups and maximize the differences between groups.
To use LDA we will need a categorical variable. Therefore, I will cut Vasemmistoliitto party’s support into four sections (quantiles): low, med_low, med_high and high. I will only use the background variables as explanatory variables in the model, meaning that all voting results regarding other than Vasemmistoliitto will be excluded from the model.
I will divide my 105 observations into a training set (80% of the observations) and test set (20%). I will use the training set for learning my model and then test set will validate its performance. I will use a biplot to visualize my results.
Next I will perform the analysis. We will notice that the results of the model will not be easily interpretable.
# analysis for the leftwing party
scaled_VAS <- df_scaled[, "Vasemmistoliitto"]
summary(scaled_VAS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6830 -0.6432 -0.1753 0.0000 0.3897 3.1710
vas <- as.numeric(cut_number(scaled_VAS,4)) %>%
factor(labels = c("low", "med_low", "med_high", "high"))
table(vas)
## vas
## low med_low med_high high
## 27 26 26 26
min <- which(colnames(df_scaled) == "IP")
max <- which(colnames(df_scaled) == "VAS+VIHR")
df2_scaled <- dplyr::select(df_scaled, -Vasemmistoliitto, -(min:max)) # vain taustamuuttujat
df2_scaled <- data.frame(df2_scaled, vas)
set.seed(12)
ind <- nrow(df2_scaled) %>%
sample(. * 0.8)
train <- df2_scaled[ind,]
test <- df2_scaled[-ind,]
correct_classes <- test[, "vas"]
# test set ready for prediction
test <- dplyr::select(test, -vas)
library(MASS)
lda.fit <- lda(vas ~. , data = train)
lda.fit
## Call:
## lda(vas ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2500000 0.2380952 0.2619048
##
## Group means:
## Mediaanitulot Keskiarvotulot Gini Naiset
## low 0.8690290171 1.04516318 0.6222789 -0.05064286
## med_low -0.0003225013 0.05519139 0.1507960 0.07698326
## med_high -0.5297775845 -0.70313406 -0.4731595 0.36330136
## high -0.4980659953 -0.62563231 -0.3990084 -0.23218180
## Ruotsinkieliset Korkeakoulutus Työttömyys Suurituloisin.10..
## low 0.96946441 0.8990153 -0.8506727 0.9427153
## med_low 0.08934603 0.1368221 -0.2584241 0.0941561
## med_high -0.44196041 -0.7782810 0.7536965 -0.6358014
## high -0.48638579 -0.4326277 0.4365850 -0.6012617
## Pienituloisin.50. X0.17v X18.29v X30.44v
## low -0.9972227 0.4892705 -0.57066488 -0.2744735731
## med_low -0.1753324 -0.3820600 0.07837955 -0.0008888253
## med_high 0.8160553 0.2203989 0.23196871 -0.2954531008
## high 0.6187012 -0.5526493 0.53028520 0.5532917476
## X45.64v X65v.
## low 0.24306291 0.2304170
## med_low 0.01364614 0.3586761
## med_high -0.05228444 -0.2594919
## high -0.30834241 -0.3112001
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## Mediaanitulot 0.48873625 1.2687191 1.5819576
## Keskiarvotulot -1.48325356 -1.0299530 -0.7587861
## Gini 0.50763285 -1.1953653 -3.5491880
## Naiset 0.41992611 0.7281170 -0.6063731
## Ruotsinkieliset -0.99711563 -0.3226375 0.9815081
## Korkeakoulutus -0.59844039 -0.2592145 1.7460963
## Työttömyys 0.04274597 1.6410836 0.4759254
## Suurituloisin.10.. 0.56013106 1.9459009 2.0710354
## Pienituloisin.50. 0.16366076 -1.2262569 2.1052261
## X0.17v -108.48397045 -55.5743239 117.8974516
## X18.29v -119.35884863 -61.7610059 130.4354907
## X30.44v -84.34260500 -45.4237723 93.6319928
## X45.64v -76.80112764 -41.6697259 85.1539016
## X65v. -86.72455798 -45.9156593 94.1146122
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7561 0.1868 0.0571
library(ggbiplot)
vas_train <- train[,"vas"]
ggbiplot(lda.fit, obs.scale = 1, var.scale = 1,
groups = vas_train, ellipse = TRUE, circle = TRUE, labels = row.names(train)) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top') + ggtitle("Biplot of Vasemmistoliito's support groups")
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 4 2 0 0
## med_low 1 3 1 0
## med_high 0 2 4 0
## high 0 1 1 2
As you can see the results are a mess as the age group variables get very high coefficients ruining the interpretability of the model. The model does not perform that badly, but because the interplay of age groups (one age group increasing its share will make the space smaller for rest of the age groups as they sum up to 1). I will redo the analysis by replacing age variables with a new variable that is a ratio of 18-44 year olds to 45- onwards. This would hopefully solve the issue.
ikasuhde <- df %>%
dplyr::mutate(nuoret_vanhat_suhde = (Kahdeksantoista_29.p + Kolmekymmenta_44.p) / (Neljakymmentaviisi_64.p + Yli_65.p))
ikasuhde <- ikasuhde %>%
dplyr::select(nuoret_vanhat_suhde)
ikasuhde_scaled <- scale(ikasuhde)
min <- which(colnames(df2_scaled) == "X0.17v")
max <- which(colnames(df2_scaled) == "X65v.")
df3_scaled <- df2_scaled %>%
dplyr::select(-(min:max))
df3_scaled["Nuoret-vanhat-suhde"] <- ikasuhde_scaled[,1]
# now redo analysis
set.seed(12)
ind <- nrow(df2_scaled) %>%
sample(. * 0.8)
train <- df3_scaled[ind,]
test <- df3_scaled[-ind,]
correct_classes <- test[, "vas"]
# test set ready for prediction
test <- dplyr::select(test, -vas)
library(MASS)
lda.fit <- lda(vas ~. , data = train)
lda.fit
## Call:
## lda(vas ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2500000 0.2380952 0.2619048
##
## Group means:
## Mediaanitulot Keskiarvotulot Gini Naiset
## low 0.8690290171 1.04516318 0.6222789 -0.05064286
## med_low -0.0003225013 0.05519139 0.1507960 0.07698326
## med_high -0.5297775845 -0.70313406 -0.4731595 0.36330136
## high -0.4980659953 -0.62563231 -0.3990084 -0.23218180
## Ruotsinkieliset Korkeakoulutus Työttömyys Suurituloisin.10..
## low 0.96946441 0.8990153 -0.8506727 0.9427153
## med_low 0.08934603 0.1368221 -0.2584241 0.0941561
## med_high -0.44196041 -0.7782810 0.7536965 -0.6358014
## high -0.48638579 -0.4326277 0.4365850 -0.6012617
## Pienituloisin.50. `Nuoret-vanhat-suhde`
## low -0.9972227 -0.31583562
## med_low -0.1753324 -0.13253126
## med_high 0.8160553 0.04993443
## high 0.6187012 0.42835446
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## Mediaanitulot -0.23121990 1.6182308 0.83179678
## Keskiarvotulot -0.05226718 -0.2543018 -0.01458681
## Gini 1.60409872 -3.6449498 -1.95169677
## Naiset 0.18180171 1.0370667 -1.10309254
## Ruotsinkieliset -0.97399476 0.4805874 0.73548781
## Korkeakoulutus -0.20163417 -1.6883704 2.84990125
## Työttömyys 0.77181436 1.0114733 0.70652848
## Suurituloisin.10.. -1.17154171 3.7195498 0.08768411
## Pienituloisin.50. -0.25284737 -0.2826991 2.10547380
## `Nuoret-vanhat-suhde` 0.61033463 1.3087730 -0.08191412
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7919 0.1365 0.0716
library(ggbiplot)
vas_train <- train[,"vas"]
ggbiplot(lda.fit, obs.scale = 1, var.scale = 1,
groups = vas_train, ellipse = TRUE, circle = TRUE, labels = row.names(train)) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top') + ggtitle("Biplot of Vasemmistoliito's support groups")
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 3 3 0 0
## med_low 1 3 1 0
## med_high 0 1 4 1
## high 0 1 1 2
Now we can see interpret the results much better. The results are a bit different LD1 has now a higher share of the captured trace. The better interpretability comes up with a price of misclassifying one observation more in the updated model than in the first one.
If we look at the second biplot, we can see that Gini is interstingly pointing towards high support for Vasemmistoliitto. If you go to my helsinkiclusters-app you can see that the data clearly contradicts this: higher the gini coeffcient, lower the support for Vasemmistoliitto. Also if we inspect the group means we can see clear negative relation to high and med_high support with gini coefficient. Share of women is difficult to interpret as high share of women is connected to med_high support, but not to high support. This means that Vasemmistoliitto’s voter base is diverse.
If we look at the biplot, we can see that there is a signifficant overlap between med_low, med_high and high support groups. Therefore, it is very difficult to create simple rules of thumb to say which variable leads certainly to high support for Vasemmistoliitto. We can say which have a negative effect more easily. The share of Swedish speaking people and high share of people belonging into the highest decile in income distribution certainly affect negatively to Vasemmistoliitto’s support. Interestingly we can find evidence that high share of small income people also predict diminished support for Vasemmistoliitto (becuse SDP and Perussuomalaiset are high on those regions).
We can conclude that our hypothesis for the first research question was confirmed. There is clear statistical evidence of green-red bubble in Kallio, Sörnäinen and Vallilla with some other nearby regions. These regions are mainly inhabited by young people and high share of women.
For the second research question it is hard to give a simple answer, but we can see from the data that high levels of income, education and share of Swedish speaking population affect negatively to Vasemmistoliitto’s support. Higher support is linked to unemployment, high ratio of young people over old in the region. Vasemmistoliitto appears to have a diverse voter as most left-wing parties have: the working class wing (mainly old working people with little education) and academic wing (mainly young). Both groups wings are joined by low income. With variables in the dataset Vasemmistoliitto’s support is not very easily easily classifiable. The accuracy might be better if there was a variable indicating students attending higher levels of education (not just completed higher levels of education).